The purpose of this document is to provide an overview of how an IRT-based Computerized Adaptive Test works, and simulate a simple CAT.
Load packages, including custom package CATFunctions to
support this analysis.
At a high-level, here are the steps for implementing an IRT-based CAT (adapted from Magis, Yan, von Davier, 2017)(Magis, Yan, and Davier 2017)
Often these are broken into 3 steps:
Before creating and simulating a Simple CAT, here are a few notes on some (but not all) of the steps (hence the numbering will appear off, but it’s not :).
Before we can run a computerized adaptive test, we require an item bank, typically an IRT-calibrated bank. For this demonstration, we will simulate an IRT-calibrated item bank for dichotomously scored items (responses are scored as 0 or 1).
The first step in a CAT is to determine which item(s) to deliver first. Here are a few options:
However, for test security purposes we shouldn’t select the same starting item for everyone. If we don’t use prior information and everyone CAT begins at the same initial ability (e.g., \({\theta}_{0} = 0\)), then it’s very likely for an IRT-calibrated item bank that at that ability level, only one item would have the greatest information (option a) or closest difficulty (option b), and that same item would then be used to start every CAT. This is referred to as initial item selection bias or a cold start problem..
Therefore, if we don’t have any prior information (which we’ll assume for the purpose of this project), we’ll need to introduce some variability into the initial item selection mechanism. For this project, we’ll use option b with some variability baked-in to the initial item selection.
In this CAT demonstration we are using an IRT-calibrated item bank for dichotomously scored items, though other item types could certainly be used, such as ordered polytomous items (items with multiple, ordered scoring levels). Those other models would need to be incorporated into the CAT, and the item-type specific parameters (e.g., ‘step’ or ‘threshold’ parameters for the Partial Credit model) would need to be captured in the item bank, with a way to distinguish, say, the Ordered Polytomous items from Dichotomous items for selection and scoring purposes.
The proficiency estimator can impact the precision and distribution of scores. Common methods are:
For this document, we’ll use MLE to estimate ability after each item.
One drawback of MLE is that the likelihood function cannot be optimized for response patterns with zero variance (all 0’s or all 1’s), and this can become problematic at early stages in the CAT.
For instance, if someone gets the first item wrong, all we know is that their ability level is probably lower than the difficulty parameter of the first item (or thereabouts). A pure MLE estimate after the first item will probably be very low, and if we’re basing item selection on the proximity of item difficulty to current ability, then the second item on the test will likely be one of the easiest item in the bank. If the respondent misses the first few items, their MLE ability estimate will be extremely low.
To demonstrate this, let’s use the CATFunctions MLE
estimation function est_ability_mle that uses the
inputs:
responses : a vector of item responsesas : a vector of IRT discrimination a
parametersbs : a vector of IRT discrimination b
parameterscs : a vector of IRT discrimination c
parametersLet’s imagine we answered the first item incorrectly (response = 0). Item parameters c(a, b, c) are c(1, .5, .1).
## [1] -4.09
An very low ability estimate (-4.09). If we base item selection on this estimate, the next item selected would b the easiest in the bank, perhaps in the neighborhood of -3 logits.
Let’s say we somehow miss the second question too:
est_ability_mle(
c(0, 0), # Response are 0 and 0
c(1, 1), # Both 'a' params are 1
c(.5, -3), # Item 1 difficulty = 0.5, Item 2 difficulty = -3
c(.1, .1), # Both 'c' params are 0.1
kludge = FALSE
)$ability_est## [1] -7.58
Again, the ability estimate is very low, and we’d witness similar extreme values if we got both items correct. Therefore, we need some way to adjust the MLE function so item selection early in the CAT isn’t subject to wild swings in ability estimate. Let’s look at the likelihood (or rather, Log-Likelihood) function which is optimized to identify the most likely value of \({\theta}\).
The log-likelihood function for the 3-parameter logistic (3PL) model is given by:
\[ \log L(\theta) = \sum_{i=1}^{n} \left[ y_i \log P_i(\theta) + (1 - y_i) \log (1 - P_i(\theta)) \right] \]
where:
\(\theta\) is the ability parameter,
\(y_i\) is the binary response (1 if correct, 0 if incorrect) to item \(i\), and
\(P_i(\theta)\) is the probability of a correct response to item \(i\), which is defined as:
\[ P_i(\theta) = c_i + (1 - c_i) \frac{1}{1 + \exp(-a_i (\theta - b_i))} \]
In this model:
\(a_i\) is the discrimination parameter for item \(i\),
\(b_i\) is the difficulty parameter for item \(i\), and
\(c_i\) is the guessing parameter for item \(i\).
The log-likelihood function incorporates these probabilities to estimate the ability parameter \(\theta\) by maximizing the likelihood of observing the given responses.
One way we can adjust the maximum likelihood estimation is by using an adjustment factor, which adds (or subtracts) a small constant to all values of \(y_i\) in response strings with zero variance when creating the likelihood function. Although our IRT model presumes binary responses of 0 or 1, the likelihood function will take any numerical value for \(y_i\) (whether or not they make sense).
Therefore, by adjusting the response strings with no variance by a little bit, we can restrict our MLE ability estimates until we get some variability in responses.
To do this, the function est_ability_mle has a (default)
argument to employ an adjustment factor (aka kludge) of \(\frac{1}{3\sqrt{n}}\) to each item, where
\(n\) is the number of items in the
response vector. For instance, the response vector
c(0,0,0,0) would have an adjustment factor of \(\frac{1}{3\sqrt{4}} = \frac{1}{6}\) and
therefore the vector would become
c(0.167,0.167,0.167,0.167).
To demonstrate this method, let’s estimate ability after each of the first 4 items of a CAT. For simplicity, we’ll fix the a and c parameters for each run, and set the ‘b’ parameter for each item to be near the previous ability estimate.
## [1] -0.55
# [1] 0.5498125; let's set 'b' for Item 2 to .55
# Second item
est_ability_mle(rep(0,2), rep(1,2), c(0.5, .55), rep(.1, 2))$ability_est## [1] -1.2
# [1] -1.20393; let's set 'b' for Item 3 to -1.20
# Third Item
est_ability_mle(rep(0,3), rep(1,3), c(0.5, .55, -1.20), rep(.1, 3))$ability_est## [1] -2.88
# [1] -2.882119; let's set 'b' for Item 4 to -2.88
# Fourth Item
est_ability_mle(rep(0,4), rep(1,4), c(0.5, .55, -1.20, -2.88), rep(.1, 4))$ability_est## [1] -5.05
As we can see, this approach really restricts the ability estimates when we have no variability in response patterns. Once the test-taker provides a response that introduces variability (in this example, they answer the 5th item correct), then this adjustment factor is ignored and the MLE estimate is based solely on the actual response patterns. Let’s see what happens if they get the next 3 items correct.
## [1] -4.24
est_ability_mle(c(0,0,0,0,1,1), rep(1,6), c(0.5, .55, -1.20, -2.88, -5.05, -4.24), rep(.1, 6))$ability_est## [1] -3.57
est_ability_mle(c(0,0,0,0,1,1,1), rep(1,7), c(0.5, .55, -1.20, -2.88, -5.05, -4.24, -3.57), rep(.1, 7))$ability_est## [1] -3.08
There we go, our estimates are coming back down to earth.
This is the step in which we determine whether the CAT should end. There are four main stopping rules that are commonly considered:
For our demonstration, we’ll use both length and precision as stopping criteria; we’ll stop the test once (a) the standard error of our ability estimate falls below a pre-defined cutoff, otherwise the test will stop once it reaches a certain length (we want to limit the testing time).
We have a few options for selecting subsequent items in a CAT. Here are a few options.
Maximum Fisher Information (MFI): Item with most information at the current ability estimate. \[ i_t^* = \arg \max_{i \in S_t} I_i(\hat{\theta}_{t-1}(X_{t-1})) \]
bOpt Criterion, or Urry’s Rule: Item with the difficulty nearest the current ability estimate. \[ i_t^* = \arg \min_{i \in S_t} \left| \hat{\theta}_{t-1}(X_{t-1}) - b_i \right| \] - Note this will be the same as MFI for Rasch and 2PL models
Maximum Likelihood Weighted Information (MLWI): Weights the information by the likelihood function of the currently administered response pattern. Addresses the issue of MFI being severely biased in early stages of the CAT. \[ i_t^* = \arg \max_{i \in S_t} \int_{-\infty}^{+\infty} L(\theta | X_{t-1}) I_i(\theta) \, d\theta \]
Maximum Posterior Weighted Information (MPWI) \[ i_t^* = \arg \max_{i \in S_t} \int_{-\infty}^{+\infty} f(\theta) L(\theta | X_{t-1}) I_i(\theta) \, d\theta \]
Legend of Terms:
\(i_t^*\): Selected item at step \(t\)
\(S_t\): Set of eligible items at step \(t\)
\(I_i(\theta)\): Item information function for item \(i\)
\(\hat{\theta}_{t-1}(X_{t-1})\): Current provisional ability estimate based on the current response pattern \(X_{t-1}\)
\(b_i\): Difficulty level of item \(i\)
\(L(\theta | X_{t-1})\): Likelihood function given response pattern \(X_{t-1}\)
\(f(\theta)\): Prior distribution of ability (e.g., standard normal distribution)
There are many other selection methods we could use as well. For our purpose, let’s just select the easiest one to implement now: bOpt Criterion, since the only values needed are the current ability estimate (\(\hat{\theta}_{t-1}(X_{t-1})\)) and eligible item locations (\(b_i\)).
Now that we have an overview of how an IRT-based CAT works, let’s summarize some of the decisions we’ve noted for this current simulation.
initial_item.Now that we know how we will set up our simulation, let’s make it happen.
Let’s simulate a 3pl item bank using the
generate_item_bank function.
# Number of items
n_items <- 500
# Set seed (for random params)
set.seed(015)
# Rasch, 1pl, 2pl, or 3pl
item_type = "3pl"
# Generate an item bank
item_bank <- generate_item_bank(n_items, model = item_type)And let’s visualize our item bank characteristics - IIFs, TIF, ICCs, and parameter distributions.
Given no prior information about the test-taker, we’ll select an
initial item for administration using the initial_item
function, given the item_bank dataframe we created earlier.
The function calculates weights for each item based on the distance from
the initial ability estimate. The returned data frame (which we’ll save
as test_event) includes placeholders for response data and
the current ability estimate. See ?initial_item for more
information.
## order item_id a b c response_score current_ability
## 383 1 383 1.24 0.0147 0.176 NA NA
## current_ability_se item_selection_ts response response_ts
## 383 NA 2024-07-18 22:27:45 NA <NA>
The first item selected is item_id = 383.
Use the score_reseponse function to score this item.
Let’s assume we get the item correct.
(test_event <- score_response(
test_event_df = test_event,
item_id = test_event$item_id[nrow(test_event)], # Item ID
response = 1 # 1 = Correct, 0 = Incorrect
))## order item_id a b c response_score current_ability
## 383 1 383 1.24 0.0147 0.176 1 0.325
## current_ability_se item_selection_ts response response_ts
## 383 1.71 2024-07-18 22:27:45 1 2024-07-18 22:27:45
Given that we got the item correct, our new ability estimate is 0.325, with a standard error of the estimate of 1.705. By default this score_response function uses the MLE kludge we mentioned earlier. If we didn’t use that kludge, here’s what the test_event table would look like:
(score_response(
test_event_df = test_event,
item_id = test_event$item_id[nrow(test_event)], # Item ID
response = 1, # 1 = Correct, 0 = Incorrect
kludge = FALSE
))## order item_id a b c response_score current_ability
## 383 1 383 1.24 0.0147 0.176 1 3.89
## current_ability_se item_selection_ts response response_ts
## 383 9.93 2024-07-18 22:27:45 1 2024-07-18 22:27:45
An ability estimate of 3.89, with a SE of 9.93… Yes, let’s use that kludge moving forward. Again it’s only going to affect item selection until there is variance in the response pattern (someone with a zero score gets an item correct, or someone with a perfect score misses an item).
Next, we’ll check to see if our stopping criteria has been met. Since we haven’t set stopping criteria, let’s do that now.
Based on those, we’ll use the stop_test function to
evaluate our test_event table against our criteria.
TRUE means the criteria has been met; stop the
testFALSE means the criteria has not been met; continue the
teststop_max_items <- 20
stop_min_se <- 0.5
stop_test(test_event_df = test_event,
max_items = stop_max_items,
min_se = stop_min_se)## [1] FALSE
Don’t stop test. Keep moving.
eligible_items <- update_eligible_items(eligible_items_df = item_bank,
test_event_df = test_event)
paste("Of",nrow(item_bank),"items in the bank, ",nrow(eligible_items), "are eligible for selection.")## [1] "Of 500 items in the bank, 499 are eligible for selection."
The “Urry’s Rule” selection criteria selects the item with the difficulty parameter closest to the test-taker’s current ability estimate.
set.seed(123)
(test_event <- next_item(eligible_items_df = eligible_items,
test_event_df = test_event))## order item_id a b c response_score current_ability
## 383 1 383 1.24 0.0147 0.1762 1 0.325
## 316 2 316 1.40 0.3279 0.0207 NA NA
## current_ability_se item_selection_ts response response_ts
## 383 1.71 2024-07-18 22:27:45 1 2024-07-18 22:27:45
## 316 NA 2024-07-18 22:27:45 NA <NA>
And at this point, we could just keep running this, changing our “answer” to 0 or 1, until the stopping criteria is met, and the test ends.
# Answer to the current question
# 0 = Incorrect, 1 = Correct
answer <- 0
# Score response
test_event <- score_response(test_event_df = test_event,
item_id = test_event[nrow(test_event),"item_id"],
response = answer)
# Check Stopping Criteria
if(stop_test(test_event_df = test_event,
max_items = stop_max_items,
min_se = stop_min_se) == FALSE) {
# If stopping criteria hasn't been met, update Eligible items
eligible_items <- update_eligible_items(eligible_items_df = eligible_items,
test_event_df = test_event)
# And select the next item.
(test_event <- next_item(eligible_items_df = eligible_items,
test_event_df = test_event))
} else {
# If the stopping criteria has been met, end the test.
print(test_event)
"The test is complete!"
}## order item_id a b c response_score current_ability
## 383 1 383 1.24 0.0147 0.1762 1 0.325
## 316 2 316 1.40 0.3279 0.0207 0 -0.168
## 80 3 80 1.66 -0.1786 0.1252 NA NA
## current_ability_se item_selection_ts response response_ts
## 383 1.71 2024-07-18 22:27:45 1 2024-07-18 22:27:45
## 316 1.10 2024-07-18 22:27:45 0 2024-07-18 22:27:45
## 80 NA 2024-07-18 22:27:45 NA <NA>
Now that we have our CAT working, let’s set it up and simulate for a few hundred people to check how the CAT functions.
We’ll use the same item bank as in our example:
item_bank
Let’s simulate our CAT for 500 people, using the same stopping criteria as before.
# Number of test takers
n_people <- 500
# Define the seed outside of the function
seed = 123
# Stopping criteria, restated
stop_max_items <- 20
stop_min_se <- 0.5
# Create a set of ability estimates
sample_abilities <- rnorm(n_people,
mean = 0,
sd = 1)When running this simulation, however, the consistency of responses will impact how the CAT performs. For instance, we could simulate every respondent (‘sim’) answering exactly as expected based on their actual ability \({\theta}\) and b-parameter of item j, \(b_{j}\). So if \(b_{j} < {\theta}\), answer correct; if \({\theta} < b_{j}\), answer incorrect. However this type of highly consistent responding is not typical, and instead, responses will have some degree of inconsistency.
To accommodate this in our simulation, the function
simulate_cat includes an argument to vary the
response_consistency when simulating responses. The
function simulates responses by selecting a response from a binomial
distribution of the prob function for a given ability and
an item’s a, b, and c parameters. The response_consistency
value multiplies the ‘a’ parameter, making the probability density
function steeper and therefore a selection from the binomial
distribution will be more consistent with the sim’s ability, \({\theta}\). The default
response_consistency is set to 1, which doesn’t impact the
prob function, and setting this argument to values above 1
will result in more consistent responding, values between 0 and 1 will
result in less consistent responding.
To demonstrate the difference in simulated response consistency,
we’ll run the simulation for two different
response_consistency levels: 1 and 5.
# Run the simulation with less consistent responding
test_cat_consistency1 <- simulate_cat(item_bank = item_bank,
abilities = sample_abilities,
seed = seed,
max_items = stop_max_items,
min_se = stop_min_se,
response_consistency = 1,
silent = TRUE)
# Run the simulation with very consistent responding
test_cat_consistency5 <- simulate_cat(item_bank = item_bank,
abilities = sample_abilities,
seed = seed,
max_items = stop_max_items,
min_se = stop_min_se,
response_consistency = 5,
silent = TRUE)## n mean median sd min max trimmed
## ability 500 0.0346 0.0207 0.9728 -2.661 3.24 0.0252
## n_items 500 14.6940 15.0000 1.9432 9.000 20.00 14.6000
## final_ability 500 0.0739 0.1055 1.0764 -3.075 3.55 0.0737
## final_ability_se 500 0.4876 0.4889 0.0101 0.447 0.55 0.4884
## test_info_at_final_ability 500 3.0861 3.1366 0.3827 1.614 4.05 3.1104
## residual 500 -0.0393 -0.0254 0.5611 -1.766 1.60 -0.0266
## mad range skew kurtosis se
## ability 0.93579 5.902 0.08586 -0.058207 0.043504
## n_items 1.48260 11.000 0.43368 0.182519 0.086901
## final_ability 1.03683 6.621 0.00851 0.138910 0.048136
## final_ability_se 0.00936 0.103 -0.21047 3.419723 0.000453
## test_info_at_final_ability 0.34905 2.439 -0.71383 0.815729 0.017115
## residual 0.58545 3.366 -0.19926 0.000292 0.025093
## n mean median sd min max trimmed
## ability 500 0.03459 0.0207 0.97277 -2.661 3.241 0.0252
## n_items 500 14.12200 14.0000 1.53356 9.000 18.000 14.1200
## final_ability 500 0.03151 -0.0309 1.00152 -2.764 3.297 0.0149
## final_ability_se 500 0.48721 0.4890 0.00939 0.447 0.500 0.4882
## test_info_at_final_ability 500 3.09034 3.1347 0.32653 1.187 3.906 3.1169
## residual 500 0.00308 0.0145 0.22817 -1.817 0.636 0.0104
## mad range skew kurtosis se
## ability 0.93579 5.9020 0.0859 -0.0582 0.04350
## n_items 1.48260 9.0000 -0.0185 -0.0102 0.06858
## final_ability 0.96599 6.0610 0.1306 -0.0162 0.04479
## final_ability_se 0.00949 0.0534 -0.9362 0.7162 0.00042
## test_info_at_final_ability 0.26961 2.7191 -1.1699 3.1640 0.01460
## residual 0.21059 2.4529 -1.1189 7.4580 0.01020
Although the distribution of estimates are similar, the error associated with the inconsistent group is quite a bit larger. Let’s
Note: Density of actual abilities is in grey.
Note: Density of actual abilities is in grey.
And let’s visualize the CAT response pattern and ability estimates for a few cases.
Since we used the same ability estimates in our simulations, and the only thing that changed between the two was response consistency, let’s see how those affected how the CAT operated.
Let’s pick the case with the ability closest to 0, case 7.
These plot show a number of things related to this CAT administration:
Let’s take a look at the CAT administrations for a few cases.
This document provided an overview of how a Computerized Adaptive Test works and demonstrated a simple CAT simulation. Key components covered include:
The document then simulated CAT administrations for 500 test-takers under two conditions: consistent and inconsistent responding. Visualizations were provided to compare the performance of the CAT under these conditions, including ability estimation accuracy, test length, and information at ability estimates. Overall, this simulation demonstrated how CATs can efficiently estimate test-taker abilities with fewer items than fixed-form tests, and showed the impact of response consistency on CAT performance. The concepts and code provided serve as a foundation for understanding and implementing basic CAT systems.